Search results for "Single-linkage clustering"
showing 10 items of 17 documents
Quantum clustering in non-spherical data distributions: Finding a suitable number of clusters
2017
Quantum Clustering (QC) provides an alternative approach to clustering algorithms, several of which are based on geometric relationships between data points. Instead, QC makes use of quantum mechanics concepts to find structures (clusters) in data sets by finding the minima of a quantum potential. The starting point of QC is a Parzen estimator with a fixed length scale, which significantly affects the final cluster allocation. This dependence on an adjustable parameter is common to other methods. We propose a framework to find suitable values of the length parameter σ by optimising twin measures of cluster separation and consistency for a given cluster number. This is an extension of the Se…
Fast dendrogram-based OTU clustering using sequence embedding
2014
Biodiversity assessment is an important step in a metagenomic processing pipeline. The biodiversity of a microbial metagenome is often estimated by grouping its 16S rRNA reads into operational taxonomic units or OTUs. These metagenomic datasets are typically large and hence require effective yet accurate computational methods for processing.In this paper, we introduce a new hierarchical clustering method called CRiSPy-Embed which aims to produce high-quality clustering results at a low computational cost. We tackle two computational issues of the current OTU hierarchical clustering approach: (1) the compute-intensive sequence alignment operation for building the distance matrix and (2) the …
Structural clustering of millions of molecular graphs
2014
We propose an algorithm for clustering very large molecular graph databases according to scaffolds (i.e., large structural overlaps) that are common between cluster members. Our approach first partitions the original dataset into several smaller datasets using a greedy clustering approach named APreClus based on dynamic seed clustering. APreClus is an online and instance incremental clustering algorithm delaying the final cluster assignment of an instance until one of the so-called pending clusters the instance belongs to has reached significant size and is converted to a fixed cluster. Once a cluster is fixed, APreClus recalculates the cluster centers, which are used as representatives for…
Incrementally Assessing Cluster Tendencies with a~Maximum Variance Cluster Algorithm
2003
A straightforward and efficient way to discover clustering tendencies in data using a recently proposed Maximum Variance Clustering algorithm is proposed. The approach shares the benefits of the plain clustering algorithm with regard to other approaches for clustering. Experiments using both synthetic and real data have been performed in order to evaluate the differences between the proposed methodology and the plain use of the Maximum Variance algorithm. According to the results obtained, the proposal constitutes an efficient and accurate alternative.
A Greedy Algorithm for Hierarchical Complete Linkage Clustering
2014
We are interested in the greedy method to compute an hierarchical complete linkage clustering. There are two known methods for this problem, one having a running time of \({\mathcal O}(n^3)\) with a space requirement of \({\mathcal O}(n)\) and one having a running time of \({\mathcal O}(n^2 \log n)\) with a space requirement of Θ(n 2), where n is the number of points to be clustered. Both methods are not capable to handle large point sets. In this paper, we give an algorithm with a space requirement of \({\mathcal O}(n)\) which is able to cluster one million points in a day on current commodity hardware.
Efficient and Accurate OTU Clustering with GPU-Based Sequence Alignment and Dynamic Dendrogram Cutting.
2015
De novo clustering is a popular technique to perform taxonomic profiling of a microbial community by grouping 16S rRNA amplicon reads into operational taxonomic units (OTUs). In this work, we introduce a new dendrogram-based OTU clustering pipeline called CRiSPy. The key idea used in CRiSPy to improve clustering accuracy is the application of an anomaly detection technique to obtain a dynamic distance cutoff instead of using the de facto value of 97 percent sequence similarity as in most existing OTU clustering pipelines. This technique works by detecting an abrupt change in the merging heights of a dendrogram. To produce the output dendrograms, CRiSPy employs the OTU hierarchical clusterin…
Clustering categorical data: A stability analysis framework
2011
Clustering to identify inherent structure is an important first step in data exploration. The k-means algorithm is a popular choice, but K-means is not generally appropriate for categorical data. A specific extension of k-means for categorical data is the k-modes algorithm. Both of these partition clustering methods are sensitive to the initialization of prototypes, which creates the difficulty of selecting the best solution for a given problem. In addition, selecting the number of clusters can be an issue. Further, the k-modes method is especially prone to instability when presented with ‘noisy’ data, since the calculation of the mode lacks the smoothing effect inherent in the calculation …
Distance-constrained data clustering by combined k-means algorithms and opinion dynamics filters
2014
Data clustering algorithms represent mechanisms for partitioning huge arrays of multidimensional data into groups with small in–group and large out–group distances. Most of the existing algorithms fail when a lower bound for the distance among cluster centroids is specified, while this type of constraint can be of help in obtaining a better clustering. Traditional approaches require that the desired number of clusters are specified a priori, which requires either a subjective decision or global meta–information knowledge that is not easily obtainable. In this paper, an extension of the standard data clustering problem is addressed, including additional constraints on the cluster centroid di…
Scalable Clustering by Iterative Partitioning and Point Attractor Representation
2016
Clustering very large datasets while preserving cluster quality remains a challenging data-mining task to date. In this paper, we propose an effective scalable clustering algorithm for large datasets that builds upon the concept of synchronization. Inherited from the powerful concept of synchronization, the proposed algorithm, CIPA (Clustering by Iterative Partitioning and Point Attractor Representations), is capable of handling very large datasets by iteratively partitioning them into thousands of subsets and clustering each subset separately. Using dynamic clustering by synchronization, each subset is then represented by a set of point attractors and outliers. Finally, CIPA identifies the…
Paradigm of tunable clustering using Binarization of Consensus Partition Matrices (Bi-CoPaM) for gene discovery
2013
Copyright @ 2013 Abu-Jamous et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Clustering analysis has a growing role in the study of co-expressed genes for gene discovery. Conventional binary and fuzzy clustering do not embrace the biological reality that some genes may be irrelevant for a problem and not be assigned to a cluster, while other genes may participate in several biological functions and should simultaneously belong to multiple clusters. Also, these algorithms cannot generate tight cluster…